International Journal of Computer Science Engineering 
and Information Technology Research (IJCSEITR) 

ISSN (P): 2249-6831; ISSN (E): 2249-7943 
Vol. 7, Issue 4, Aug 2017, 67-70 
© TJPRC Pvt. Ltd. 

REVIEW PAPER ON DECISION TREE DATA MINING ALGORITHMS TO IMPROVE 
ACCURACY IN IDENTIFYING CLASSIFIED INSTANCES USING LARGE DATASET 

GURPREET SINGH 1 & ER. RAJWINDERKAUR 2 

1 Professor, Department of Computer Science & Engineering, St. Soldier Institute of 
Engineering & Technology, Jalandhar Punjab, India 
2 M. Tech Scholar, Department of Computer Science & Engineering, St. Soldier Institute of 
Engineering & Technology, Jalandhar, Punjab, India 

ABSTRACT 

The CART distance based algorithm with the classification tree paradigm based on the C45 algorithm. The CART 
algorithm is used as a preprocessing algorithm in order to obtain a modified training database for the posterior learning of 
the classification tree structure. Then the incorrectly classified instances are duplicated with the previous data set and finally 
C45 is applied to complete the classification procedure of biomedical data. 

KEYWORDS: The Hierarchal Model of Decisions . Data Mining is a Technology that Draws Out Information from Colossal 
Amount of Gigantic Data and Remolds it into a Human Understandable Form 
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INTRODUCTION 

Decision trees basically us the hierarchal model of decisions and their consequences. The structure of 
decision tree includes branch, root node and leaf node. Attributes test is denoted on each interval node, the test 
outcome is denoted by branch and class labels are shown by leaf node. The topmost node is the root node of the tree. 
The tree learning is done by dividing the source into set which are generally based on a test of attribute value. Data 
mining is a technology that draws out information from colossal amount of gigantic data and remolds it into a human 
understandable form. There are many other terminologies identical to data mining-knowledge mining from data, 
knowledge extraction. 

Review on 

• Enhanced decision tree algorithm which will work on large scale high. An algorithm can be made with 
certain split selection methods involved from the literature which includes algorithms like C4.5 and CART. 

• Enhance the efficiency with a new classifier that combines the CART distance based algorithm with the 
classification tree paradigm based on the C45 algorithm. 

• Reducing present sum of square error- the proposed algorithm gives reduced sum of square error as compare 
to the CART and C4.5 classification algorithm which means that the new algorithm gives more accuracy. 

• Enhancement in the efficiency of decision tree construction- various pruning techniques are proposed which 
can help in the improvement of decision tree construction. 

• Research methodology is the organized way to solve a research problem. It is a conceptual way which tells 
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that how the research is done by the researcher. 

C4.5 ALGORITHM 

C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome 
this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with 
Info (D) as 


SpluI^D )= - t Igjl x Log, (Etl). 

This value represents the potential information generated by splitting the training data set, D, into v partitions, 
corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of tuples 
having that outcome with respect to the total number of tuples in D. It differs from information gain, which measures the 
information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as 


GairtRrtfiflfA) — 


Split Info{A ) 1 


The Gini index is used in CART. Using the notation described above, the Gini index measures the impurity of D, a 
data partition or set of training tuples, as 


m 

Gini{D) = I - 

1=1 

Classification & Regression Trees (CART) 

CART were invented independently of one another at around the same time, yet follow a similar approach for 
learning decision trees from training tuples. These two cornerstone algorithms spawned a flurry of work on decision tree 
induction. 

Decision tree induction can be adapted so as to predict continuous (ordered) values, rather than class labels. There 
are two main types of trees for prediction—regression trees and model trees. Regression trees were proposed as a 
component of the CART learning system. (Recall that the acronym CART stands for Classification and Regression Trees.) 
Each regression tree leaf stores a continuous-valued prediction, which is actually the average value of the predicted 
attribute for the training tuples that reach the leaf. Since the terms “regression” and “numeric prediction” are used 
synonymously in statistics, the resulting trees were called “regression trees,” even though they did not use any regression 
equations. By contrast, in model trees, each leaf holds a regression model—a multivariate linear equation for the predicted 
attribute. Regression and model trees tend to be more accurate than linear regression when the data are not represented well 
by a simple linear model. 
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Time taken to build model: 0.03 seconds 

= Evaluation on training set = 

=== Summary === 


Correctly Classified Instances 
Incorrectly Classified Instances 
Kappa statistic 
Mean absolute error 
Root mean squared error 
Relative absolute error 
Root relative squared error 
Total Number of Instances 

= Detailed Accuracy By Class 

TP Rate FP Rate Precision 
1 0.56 0.839 

0.44 0 1 

= Confusion Matrix = 

a b <— classified as 

73 0| a = good 

14 11 | b = bad 


84 85.7143 % 

14 14.2857 % 

0.5393 
0.2339 
0.342 
61.1713 % 

78.4536 % 

98 


Recall F-Measure Class 
1 0.913 good 

0.44 0.611 bad 


Figure 1: Results of J45 Algorithm in Weka 


Time taken to build model: 0.03 seconds 

=== Evaluation on training set === 

=== Summary === 


Correctly Classified Instances 
Incorrectly Classified Instances 
Kappa statistic 
Mean absolute error 
Root mean squared error 
Relative absolute error 
Root relative squared error 
Total Number of Instances 


79 

19 

0.4381 
0.3038 
0.3S9S 
79.4424 h 
89.4057 % 
9S 


SO.6122 % 
19.3E7S % 


- Detailed Accuracy By Class - 


TP Rate FP Rate 


Precision 


Recall F-Measure 


Class 


0.91S 
0.4S 


0.52 
0.0S2 


0. S3S 
0.667 


0.91S 
0.4S 


0. S76 
0. 55S 


good 

bad 


=== Confusion Matrix === 


a b <— classified as 
67 6 | a = good 

13 12 | b = bad 

Figure 2: Results of Decision Stump Algorithm in Weka 


CONCLUSIONS 

In this comparative study found that J45 gives the better performance as compare to Decision Stump with 
minimum error rate or high accuracy, maximum percentage of correctly classified instances on same data set and 
parameters. 
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